Discriminant analysis with errors in variables 

Sebastien Loustau* and Clement Marteau^ 



Abstract 

The effect of measurement error in discriminant analysis is investigated. Given obser- 
vations Z = X + e, where e denotes a random noise, the goal is to predict the density of 
X among two possible candidates / and g. We suppose that we have at our disposal two 
£Nj ' learning samples. The aim is to approach the best possible decision rule G* defined as a 

minimizer of the Bayes risk. 

In the free- noise case (e = 0), minimax fast rates of convergence are well-known under the 
margin assumption in discriminant analysis (see [23]) or in the more general classification 
framework (see [301 Ej)- In this paper we intend to establish similar results in the noisy case, 
i.e. when dealing with errors in variables. In particular, we discuss two possible complexity 
assumptions that can be set on the problem, which may alternatively concern the regularity 
of / — g or the boundary of G*. We prove minimax lower bounds for these both problems 
and explain how can these rates be attained, using in particular Empirical Risk Minimizer 
(ERM) methods based on dcconvolution kernel estimators. 



H 

GO . 

r£ '. 1 Introduction 

^ ■ (i) (i) 

In the problem of discriminant analysis, we usually observe two i.i.d. samples , . . . ,Xn 
and x[ 2 ^ , . . . , lm' ■ Each observation X® G W 1 is assumed to admit a density with respect to 
a a- finite measure Q, dominated by the Lebesgue measure. This density will be denoted by / 
J> ■ if the observation belongs to the first set (i.e. when i = 1) or g in the other case. Our aim 

is to infer the density of a new incoming observation X. This problem can be considered as a 
! particular case of the more general and extensively studied binary classification problem (see 

CO ' |12] for a detailed introduction or [7] for a concise survey). 

In this framework, a decision rule or classifier can be identified with a set G C M. d , which 
attributes X to / if X G G and to g otherwise. Then, we can associate to each classifier G its 
corresponding Bayes risk Rk{G) defined as 



K/G 



f(x)dQ(x) + / g(x)dQ(x) 
Jg 



(1.1) 



where we restrict the problem to a compact set K C M. d . The minimizer of the Bayes risk (the 
best possible classifier for this criterion) is given by 

G* K = {x G K : f{x) > g(x)}, (1.2) 

where the infimum is taken over all subsets of K. The Bayes classifier is obviously unknown 
since it explicitly depends on the couple (/, g). The goal is thus to estimate G* K thanks to a 
classifier G nyTn based on the two learning samples. 

In this paper we propose to estimate the Bayes classifier G* K defined in (jl.2p when dealing 
with noisy samples. For all i G {1,2}, we assume that we observe 

Zf =xf+ef ) ,j = l,...n i , (1.3) 
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(i) (i) 

instead of the X- . The denotes random variables expressing measurement errors. We will 
see in this paper that we are faced to an inverse problem, and more precisely to a deconvolu- 
tion problem. Indeed, if e admit a density r] with respect to the Lebesgue measure, then the 
corresponding density of the Zy is the convolution product (/.//) * r) if % = 1 or (g.p) * rj if 
i = 2, provide that dQ(x) = p{x)dx for some bounded function p. It gives rise to a deconvolu- 
tion step in the estimation procedure. Deconvolution problems arise in many fields where data 
are obtained with measurements errors and are at the core of several nonparametric statistical 
studies. For a general review of the possible methodologies associated to these problems we may 
mention for instance |26j. More specifically, we refer to [H] in density estimation or [8] where 
goodness-of-fit tests are constructed in the presence of noise. The main key of all these studies 
is to construct a deconvolution kernel which may allow to annihilate the noise e. More details 
on the construction of such objects are provided in Section [3] It is important to note that in 
this discriminant analysis setup, or more generally in classification, there is up to our knowledge 
no such a work. The aim of this paper is to describe minimax rates of convergence in noisy 
discriminant analysis under the margin assumption. 



In the free-noise case, i.e. when e = 0, [23] has attracted the attention on minimax fast rates 
of convergence (i.e. faster than n") and states in particular 



inf sup Rk(G) - R K {G* K ) raw 2 +»+"«, asn-^+oo, (1.4) 

G G* K &G{a,p) L J 

where Q(a,p) is a non parametric set of candidates G* K with complexity p > and margin 
parameter a > (see Section 2.1 for a precise definition). In (|1.4p . the complexity parameter 
p > is related to the notion of entropy with bracketing whereas the margin is used to relate the 
variance to the expectation. It allows [23] to get improved bounds using the so-called peeling 
technique of [16]. This result is at the origin of a recent and vast litterature of fast rates of 
convergence in classification (see for instance [251 12]) or in general statistical learning (see |19|). 
In these papers, the complexity assumption can be of two forms: geometric assumption over the 
class of candidates G* K (such as finite VC dimension, or boundary fragments) or assumptions 
on the regularity of the regression function of classification (plug-in type assumptions). In [25], 
minimax fast rates are stated for finite VC class of candidates whereas plug-in type assumptions 
have been studied in classification in [2] (see also [12] [28]). More generally [19] proposes to 
consider p > as a complexity parameter in local Rademacher complexities and gives general 
upper bounds generalizing ()1.4[) and the results of [23] and [2]. 

In all these results, empirical risk minimizers appear as good candidates to reach these fast 
rates of convergence. Indeed, given a class of candidates Q, a natural way to estimate G* K is to 
consider an Empirical Risk Minimization (ERM) approach. In standard discriminant analysis 
(e.g. in the free- noise case considered in [23])) the risk Rk{G) in (jl.2p can be estimated by 

j n j m 

Rn,m{G) = ^J2 VfW} + \x^eG}' ( L5 ) 

t=l i=l 

leading to an empirical risk minimizer G njm , if it exists, defined as: 

G n>m = argmin^ nifn (G). (1.6) 
GGy 

Unfortunately, in the error- in- variable model, since we observe noisy samples Z = X + e, 
the probability densities of the observed variables w.r.t. the Lebesgue measure are respectively 
convolution (f.p) * r] and (g.p) *t]. As a result, classical ERM principle fails since: 



^ n ^ in ^ 

2^ 1 {^ (1) 6G^} + ^ Yl 1 {zf» eG } - * 2 

i=l i=l 



(f.p) * r](x)dx + / (g.p) * rj(x)dx 
K/G JG 



+ Rk(G). 
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As a consequence, we propose to add a deconvolution step in the classical ERM procedure by 
considering the solution of the minimization: 

where m {G) is an asymptotically unbiased estimator of Rk{G) which uses kernel deconvolu- 
tion estimators with smoothing parameter A. It is called deconvolution empirical risk and will 
be of the form 

-, n 1 m 

R UG) = ^Y. h CoAzf) + —^h G , X {zf\ (1.7) 
3=1 3=1 

where the hc,\(-) are smoothed versions of indicator functions used in classical ERM for direct 
observations (see Section 3 for details). 

In this paper, we would like to describe as precisely as possible the influence of the error e 
on the classification rates of convergence and the presence of fast rates. Our aim is to use the 
asymptotic theory of empirical processes in te spirit of [16] (see also [32]) when dealing with 
the deconvolution empirical risk (|1.7p . To this end, we give the explicit form of functions hc,\ 
in these framework. In particular, we need to study in details the complexity of the class of 
functions {Hq \, G £ G} in order to get statistical performances of the solution of the ERM esti- 
mator. This complexity is related to the imposed complexity over Q, such as boundary fragment 
assumptions or regularity hypothesis on the function / — g. For each assumption, we establish 
lower and upper bounds and discuss the performances of this deconvolution ERM estimator 
for this problem. Such a study allows a first comparison of the robustness of these complexity 
assumptions w.r.t. the presence of errors in variables. Remark that the results presented here 
focus on the discriminant analysis set up but could be generalized to the classification framework 
in a future work. Moreover the problem of adaptation will not be considered in this paper but 
could be the core of a more advanced contribution. 

We point out that the definition of the empirical risk (|1 .7j) leads to a new and interesting 
theory of risk bounds detailed in Section 3 for discriminant analysis. In particular, the parameter 
A has to be calibrated to reach a bias/variance trade-off in the decomposition of the excess risk. 
Related ideas have been recently proposed in [TH] in the gaussian white noise model and density 
estimation setting for more general linear inverse problem using singular value decomposition. In 
our framework, up to our knowledge, the only minimax result is [T7] which gives minimax rates 
in Hausdorff distance for manifold estimation in the presence of noisy variables. [11] gives also 
consistency and limiting distribution for estimators of boundaries in deconvolution problems, 
but no minimax results are proposed. In the direct case, we can also apply this methodology 
and consider an empirical risk given by the estimation of / and g using simple kernel density 
estimators. This idea has been already mentioned in [33] in the general learning context and 
called Vicinal Risk Minimization (see also [9]). However even in pattern recognition and in the 
direct case, up to our knowledge, there is no asymptotic rates of convergence for this empirical 
minimization principle. 

The paper is organized as follows. In Section [21 the two main complexity assumptions used 
in this paper are explicited and associated lower bounds are proposed. These lower bounds 
generalize the previous lower bounds of [23] and [2]. Deconvolving ERM attaining these rates 
are presented in Section [3l A brief discussion and some perspectives are gathered in Section [3] 
while Section [5] is dedicated to the proofs of the main results. 
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2 Plugin vs boundary fragments 



In this section, we detail some common assumptions (complexity and margin) that can be set 
on the pair (f,g). We then propose lower bounds on the corresponding minimax rates. 

First of all, given a set G C K, simple algebra indicates that the excess risk Rk(G) — Rk(G*) 
can be written as: 

R K (G) - R K (G* K ) = ±d f , g (G,G* K ), 
where the pseudo-distance <iy iS over subsets of K C M. d is defined as 

d f>g (G u G 2 )= [ \f-g\dQ, 
JG1AG2 

and G±AG 2 = [G\ n G2] U [G^ H G\] is the symmetric difference between two sets G\ and G 2 . 
In this context, there is another natural way of measuring the accuracy of a decision rule G 
through the quantity: 

d A {G,G* K )= [ dQ, 

J GAG* K 

where d& defines also a pseudo-distance on the subsets of K C M. d . 

In this paper, we are interested in the minimax rates associated to these pseudo-distances. In 
other words, given a class F, one would like to quantify as precisely as possible the corresponding 
minimax risks defined as 

inf sup da(G nim ,G* K ), 

G n>m (f,g)eJ r 

where the infimum is taken over all possible estimators of G* K and du stands for or c?a 
following the context. In particular, we will exhibit classification rules G nm attaining these 
rates. In order to obtain a satisfying study of the minimax rates mentioned above, one need to 
detail the considered classes T . Such a class expresses some conditions that can be set on the 
pair (/, g). They are often separated in two categories: margin and complexity assumptions. 

A first condition that can be set on the pair (/, g) is the well-known margin assumption. It 
has been introduced in discriminant analysis (see [23]) as follows: 

Margin Assumption: There exists positive constants to,c 2 ,a > such that for < t < to-' 

Q{x G K : \f(x) - g(x)\ < t} < c 2 t a . (2.1) 

This assumption is related to the behaviour of \f — g\ at the boundary of G* K . It may give a 
variety of minimax fast rates of convergence which depends on the margin parameter a. A large 
margin corresponds to configurations where the slope of |/ — g\ is high at the boundary of G* K . 
The most favorable case corresponds to a margin a = +00 when f — g jumps at the boundary 
ofG* K . 

From a practical point of view, this assumption provides a precise description of the interac- 
tion between the pseudo distance df t9 and d&. In particular, it allows a control of the variance 
of the empirical processes involved in the upper bounds. Note that in the presence of noise in 
variables, Lemma [5] in Appendix proposes an usefull generalization of Lemma 2 in |24j . More 
general assumptions of this type can be formulated (see for instance [6] or [19] ) in a more general 
statistical learning context. 



4 



The margin assumption is 'structural' in the sense that it describes the difficulty to dis- 
tinguish an observation having density / from an other with density g. In order to provide a 
complete study, one also needs to set an assumption on the difficulty to find G* K in a possible set 
of candidates, namely a complexity assumption. In the classification framework, two different 
kind of complexity assumptions are often proposed in the literature. The first kind concerns 
the regularity of the boundary of the Bayes classifier. Indeed, our aim is to estimate G* K , which 
yet corresponds to a nonparametric set estimation problem. In this context, it seems natural 
to traduce the difficulty of the learning process by condition on the shape of G* K . An other 
way to describe the complexity of the problem is to impose condition on the regularity of the 
underlying densities / and g. Such kind of condition is originally related to plug-in approaches. 

Remark that any clear connexion can be established between such kind of assumption: a 
set G* K with a smooth boundary is not necessarily associated to smooth densities. In the two 
following subsections, we provide a precise description of the assumptions that we will use in this 
paper. In each case, we propose lower bounds for the associated minimax rates of convergence 
in this noisy setting. Corresponding upper bounds are presented and discussed in Section 

For the sake of convenience, we will also require in the following an additional assumption 
on the noise e. We assume in the sequel that e = (ei, . . . , e^)' admit a density 77 with respect to 
the Lebesgue measure satisfying 



In other word, the entries of the vector e are independent. The assumption below describes the 
difficulty of the considered problems. It is often called the ordinary smooth case in the inverse 
problem litterature. 

Noise Assumption: There exist (/3 ls . . . , /3d)' G such that for all i € {1, ... , d}, Pi > 1/2, 
\^h](t)\ ~ \t\~P% and 1-7=%] (f) I ~ \t\~ pi as t -»• +00, 

where J-[r]i} denotes the Fourier transform of the rji- Moreover, we assume that J-[rn\{t) 7^ for 
allt £t and i £ {1, . . . , d}. 

Classical results in deconvolution (see e.g. [H], [IS] or [8] among others) are stated for d = 1. 
Two different settings are then distinguished concerning the difficulty of the problem which is 
expressed through the shape of J-[rj\. One considers alternatively the case where |^"[?7](t)| ~ 
as t — > +00, which yet corresponds to mildly ill-posed inverse problem or ~ e -7 * as 

t —> +00 which leads to a severely ill-posed inverse problem. This last setting corresponds to a 
particularly difficult problem and is often associated to low minimax rates of convergence. 

In this paper, we only deal with d-dimensional mildly ill-posed deconvolution problems. For 
the sake of brevity, we do not consider severely ill-posed inverse problems or possible intermedi- 
ates (e.g. a combination of polynomial and exponential decreasing densities). Nevertheless, the 
rates in these cases could be obtained through the same steps. 

2.1 The boundary fragment assumption 

We focus in this subsection on an assumption related to the regularity of the boundary of G* K . 
More precisely, we deal with the family of boundary fragments on K = [0, l] d . A set G C [0, l] d 
belongs to a class of boundary fragments (see [20]) if there exists b : [0, — > [0, 1] such that: 



d 



r l (x) = ]Jr ) i(x i ) Vx£R d . 



(2.2) 



i=i 



G = {x 



(xi, ...x d ):x d < b(xi, x d _i)} := G b . 
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For given 7, L > the class of Holder boundary fragments is then defined as 



0(7, L) = {G b ,b€ S( 7 ,L)}, 

where £(7, L) is the class of isotropic Holder continuous functions b(x\, . . . ,2^-1) having con- 
tinuous partial derivatives up to order [7] , the maximal integer strictly less than 7 and such 
that: 

\b{y) - PbjX (y)\ <L\x-y\Wx,ye [0,1]^, 
where pb jX is the Taylor polynomial of b at order [7] at point x. 

Boundary fragment assumption. There exist 7f and L positive constants such that the 
set G* K belongs to Q{^^,L). 

In the following, we denote by -Ffrag the set of all pairs (/, g) satisfying both the margin and 
boundary fragment assumptions. Theorem 1 states lower bounds for the minimax risks over the 
class .Ffrag. The proof is postponed to Section [5j 

Theorem 1 Let K = [0, l] d and J- = J-{ ia g- Suppose that Q is the Lebesgue measure on K and 
that the noise assumption is satisfied. Then 

liminf inf sup (n A m ) Td(a ^'^' l d n (G„, m , G* K ) > 0, 

n— S-+CO Q n m (/,g)eJ r f ra g 

where the infinimum is taken over all possible estimators of the set G* K and 



Td(a,/3,7f) = < 



7a 



d-1 



for (in = c^a 



7 f (2 + a) + (d - l)a + 2a ^ ft + 2a/3 d 7 f 



i=l 



7(a + 1) 



d-1 



for d u =df t9 . 



(2 + a) + {d - l)a + 2a ft + 2aftn f 



i=l 



Remark that we obtain exactly the same lower bounds as [23] in the direct case, which yet 
corresponds to the situation where (5j = for all j E {1, . . . , d}. In this particular framework, 
the minimax rate of convergence mainly depends on 7f and a. The coefficient 7f corresponds 
to the regularity of the boundary of G* K . Greater is 7f, easier is the estimation. The term a 
is related to the margin assumption. The case a = +00 actually corresponds to a jump of the 
function f — g near the boundary of G* K . On the opposite hand, a small a is associated to a very 
difficult problem since the difference between / and g may be quite small in such a situation. 

In the presence of noise in the variables, the rates obtained in Theorem [1] are slower. The 
price to pay is an additional term of the form 

d-i 



2aJ]ft + 2a/3 d7f . 



i=l 

This term clearly connects the difficulty of the problem to the values of the coefficients ft, . . . , fid- 
Moreover the above expression highlights a connection between the margin parameter and the 
ill-posedness. The role of the margin parameter over the inverse problem can be summarized 
as follows. Higher is the margin, higher is the price to pay for a given degree of ill-posedness. 
When the margin parameter is small, the problem is difficult at the boundary of G* K and we can 
only expect a non-sharp estimation of G* K . In this case it is not significantly worst to add noise. 



1. 



On the contrary, for large margin parameter, there is nice hope to give a sharp estimation of 
G* K and then perturb the inputs variables have strong consequences in the performances. 

Remark also in the above expression that the first d — 1 components of e have not the same 
impact as the last (vertical) component. This is due to the fact that we consider boundary 
fragments with a given regularity 7f. 

2.2 The plug-in assumption 

The boundary fragment assumption concerns the set G* K and in particular the smoothness of 
its boundary. Other conditions have been proposed in the literature in order to explain and 
quantify the difficulty related to a classification problem. 

An alternative hypothesis concerns the regularity of the function f — g itself. In the following, 
we denote by 1/(7, L) the class of d-dimensional isotropic Holder continuous functions. 

Plug-in Assumption. There exists 7 P and L' positive constants such that f — g £ T,'(^ p ,L'). 

We then call J~ p \ug the set of all pairs (/, g) satisfying both the margin and plug-in assump- 
tions, since the previous assumption is often associated to plug-in rules in the statistical learning 
literature. The following theorem proposes a lower bound for the noisy smooth discriminant 
analysis problem in such a setting. 

Theorem 2 Let J- = .Fpiug. Suppose that Q is absolutely continuous with respect to the Lebesgue 
measure and that the noise assumption is satisfied. Then, provided a < 1, 

liminf mf sup (n A mf'^^ d u {G n ,m, G* K ) > 0, 

n— >+oo Q n m (/,g)GJp lug 

where the infinimum is taken over all possible estimators of the set G* K and 

^ 2 for do = d A 

7p (2 + Q) + d + 2^/3 i 
i=i 

7p(a + l) , , , 
^ '- — 2 for d u = d ft g. 

7p (2 + a) + d + 2^A 

i=l 

As in the previous subsection, we obtain the same lower bound as [2] in the direct case, i.e. when 
= for all i € {1, . . . ,d}. Once again, the larger a, the easier the estimation. Moreover, 
smooth densities will provide a simpler classification problem. 

As in Theorem 1, in the presence of noise in the variables, the rates obtained in Theorem [1] 
are slower. The price to pay is an additional term of the form 2 Yli=i Pi- Nevertheless, the way 
where the parameters 7 P ,a and the /3j interact is slightly different than for boundary fragment 
assumption. This is not surprising since the structure and the complexity of the problem have 
changed. Here 7 P denotes the regularity oi f—g and interacts directly with the margin parameter 
a. 

Remark that this lower bound is valid only for a < 1. Since we use in the proof of Theorem 
[2] an algebra based on standard Fourier analysis tools, we have to consider sufficient smooth 
objects. As a consequence in the lower bounds, we can check the margin assumption only for 
values of a < 1. Nevertheless, we conjecture that this restriction is only due to technical reasons 
and that our result remains pertinent for all a, 7 S M. In particular, an interesting direction 
is to consider a wavelet basis which provides an isometric wavelet transforms in L 2 in order to 
obtain the desired lower bound in the general case. 
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3 Upper bounds 



3.1 Estimation of G 



A 



In the free-noise case (e- 



U) 



y(2) 



for all i £ € {1,2}), we deal with two samples 

having respective densities / and g. A standard way to esti- 



mate G* K = {x 6 K : f(x) > g(x)} is to estimate Rk(') thanks to the data. For all G C K, the 
risk Rk(G) can be estimated by the empirical risk defined in (jl.5p . Then the Bayes classifier 
G* K is estimated by G n ^ m defined as a minimizer of the empirical risk (jl.5p over a given family 
of sets Q. We know for instance from [23] that the estimator G n)Tn reaches the minimax rates 
of convergence of Theorem 1 for /? = when Q := Q{^,L) corresponds to the set of boundary 
fragments with 7 > d — 1. For larger set Q(-y,L), as proposed in [21], the minimization can be 
restricted to an 6— net of £(7, L). With an additional assumption over the approximation power 
of this 5— net, the same minimax rates can be achieved in a subset of ^(7, L). 

If we consider complexity assumptions related to the smoothness of / — g, we can show 
coarsely with [5] that an hybrid plug- in /ERM estimator reaches the minimax rates of convergence 
of Theorem 2 in the free-noise case. The principle of the method is to consider the empirical 
minimization (|1.5|) over a particular class Q based on plug-in type decision sets. More precisely, 
following [2] for classification, we can minimize in the direct case the empirical risk over a class 
Q of the form: 

G = {{f-g>0},f-geAf n , m }, 

where M nt m is an 5— net over the class of densities, and where S := 5 n is well chosen. With such 
a procedure, minimax rates can be obtained with no restriction over the parameter 7 P , a and d. 



In noisy discriminant analysis, ERM estimator (|1.6p is not available since we only observe 
noisy samples. The probability densities of the samples w.r.t. the Lebesgue measure are respec- 
tively convolution (f/J,) * r] and (gfi) * rj and then classical ERM principle fails since: 



i=l 



^ m 



{zf° G G} n ^oo 2 



1 

> - 



K/G 



{f.ji) * r]{x)dx + / (g.fj.) * rj(x 



G 



+ Rk(G). 



Hence, we have to add a deconvolution step to the classical ERM estimator. In this context, we 
can construct a deconvolution kernel provided that the noise has a nonnull Fourier transform, as 
expressed in the Noise Assumption. This is rather classical in the inverse problem literature (see 
e.g. [2], 0) [IH] or [26]). With such an assumption, we are able to construct a deconvoluting 
kernel as follows. 



Let K, 



be a c?-dimensional function defined as the product of d uni- 



dimensional function K,j. The properties of K, leading to satisfying upper bound (depending 
on the considered complexity assumption) will be precised later on. Then if we denote by 
A = (Ai, . . . , \d) a set of (positive) bandwidths and by the Fourier transform, we define K,^ 

as 



/C, 



x ^ lC v (t) = F l 



.^W(VA). 

In this context, for all G C K, the risk Rk(G) can be estimated by 



(3.1) 



1 n m 



i=i 



i=i 
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where for a given z € M, 



h G , x (z) := I \K, V ( Z —^- ) dQ{x). (3.2) 



In the following, we propose to study ERM estimators defined as 



G n,m = argmm J R nfn (G), (3.3) 



where the parameter A G has to be chosen explicitly. It is important to note that in (|3.2p 
\iq x depends on Q. Hence, the measure Q needs to be known a priori. It differs from the direct 
case where the empirical risk is independent of the nature of Q. Here functions hc,x in equation 
(13. 2p are at the core of the upper bounds. In particular, remark that following the pioneering's 
works of Vapnik (see [33]), we have 

R K(G x m ) - Rk(G*) < R K (G n m ) - R n m (G X )m ) + R n m (G* K ) - R K (G*), 

< R K(Gn,m) ~ Rn,m(Gn,m) + Rn,m(G* K ) — R\(G*) 

+(Rk - RX K )(GiJ - (Rk - rX k)(G* k ) 

< SU p\R x K -R^ m \(G-G* K ) + sup \R X K -R K \(G-G* K ), (3.4) 
Gee Gee 

where r k{-) corresponds to the expectation of R x m (.). As a result, to get risk bounds, we have 
to deal with two opposing terms, namely a so-called variability term 

S n V \R x K -R x >m \(G-G* K ), (3.5) 
Geg 

and a bias term (since ER X m (G) ^ Rk(G)) of the form: 

sup\R x K -R K \(G-G* K ). (3.6) 

Geg 

The variability term (|3.5[) gives rise to the study of increments of empirical process. In this 
paper this control is based on entropy conditions and uniform concentration inequalities which 
are inspired by results presented for instance in [32] or |16j . The main novelty here is that in the 
noisy case, empirical processes are indexed by a class of functions which depends on the smooth- 
ing parameter A. The bias term (|3.6p is controlled by taking advantages of the properties of Q 
and of the assumptions on the kernel /C. Indeed, it can be related to the standard bias term in 
non parametric density estimation with more or less technicalities, according to the smoothness 
assumption (boundary fragements or plug-in type). This bias term is inherent to the proposed 
estimation procedure and its control is a cornerstone of the upper bounds. 



The choice of A will be a trade off between the two opposing terms (|3.5p and (|3.6p . Small 
A > leads to complex functions hc,\ and blast the variance term whereas (|3.6p vanishes when 
A tends to zero. The kernel JC has to be chosen in order to take advantage of the different 
conditions on Gk*- This choise will be operated according to the following definition. 

Definition We say that K, is a kernel of order I £ N* with respect to Q if and only if: 

• f K K{u)dQ(u) = 1 V j = 1, . . . d. 

• f K u^lC(u)dQ(u) = V A; = 1, . . . I, V j = 1, . . . d. 

• f K \uj\ l \IC(u)\dQ(u) < oo, V j = 1, . . . d. 
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In addition to this definition, we will require that the deconvolution kernel is convenient for the 
noise rj through the following assumption. Such an assumption is rather standard and is for 
instance satisfied if the kernel fC has a compactly supported Fourier transform (see the proof in 
the Appendix) and even under a polynomial decreasing behavior of |J r [/C.^](t)|. 



Kernel Assumption. The Kernel K, is such that 

d 

sup \T[1C v }(t)\<Cll\-' 

for some positive constant C. 



The two following subsections propose to study deconvolution ERM estimator (j3.3|) and give 
asymptotic rates of convergence for particular choices of A. Under the margin assumption, fast 
and optimal rates are stated depending on the complexity assumption considered: the Boundary 
fragment assumption or the Plug-in assumption. 



3.2 Upper bound for the plug-in assumption 

We first point out that for the sake of coherence, we will not study plug-in rules in this paper, 
although a study similar to [2] could be managed. Since our aim is to establish minimax rates 
of convergence under two different complexity assumptions, we focus on the same ERM type 
estimators of the form (13.31) . 



For all 6 > 0, using the notion of entropy (see for instance |32j ) for Holderian function on 
compact sets, we can find a <5-network Ms on E'(7 p ,L) such that 

• log (card (Ms)) < A5' d /^ 

• For all /io € ^'(7p 5 L), we can find h G Ms such that \\h — /io||oo < S. 

In the following, we associate to each v := / — g G X'(7 P , L), a set G v = {x : v(x) > 0}. Under 
the plug-in assumption, our ERM estimator will then be defined as 

G n , m = arg min (G u ), (3.7) 
ueNs 

where 5 = 5 n has to be chosen carefully. This procedure has been introduced in the direct case 
by [2] and refered as an hybrid Plug-in/ERM procedure. The following theorem describes the 
performances of G nm . 



Theorem 3 Let T = J~ping and G n ^ m the set introduced in (3.7) with 

2 

1 / T~ld 2/7p+2+c 

\ = n ^ (2+a)+2E "=i^ +d , Vi G {1, . . . ,n}, and 5 n = ±- U=1 * 



Suppose that the noise assumption is satisfied with > 1/2, Vi = 1, . . . d. Consider a kernel IC V 
defined as in (|3.ip where K, = Hj =1 lCj is a kernel of order [7] with respect to Q, which satifies 
the kernel assumption. Then 



lim sup (n A my^P'^ d(G n , G* K ) < +00, 

plug 



n ^+°° (/,g)eJp lu 
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where 



TpO 

d 

7 p (2 + a) + d + 2^ft 
i=i 

7p(« + 1) 

d 

7p (2 + a) + d + 2^ft 

i=l 



/or d = d& 



/or d = d/, r 



Theorem 3 validates the lower bounds of Theorem 1. Deconvolution ERM are minimax optimal 
over the class J-piug. 

These optimal rates are characterized by the tail behavior of the characteristic function of 
the error distribution n. We only consider the ordinary smooth case whereas straightforward 
modifications leads to low rates of convergence in the super smooth case. 

Here fast rates are proposed provided that oey > d + ^2 Pi- However it is important to 
note that large values of both a and 7 corresponds to very restrictive situations. In this case 
the margin parameter is high whereas the behavior of / — g is smooth, which seems to be 
contradictory (see the related discussion in [2]). 

3.3 Upper bound for the boundary fragment assumption 

For the sack of concision, we propose to restrict the set of all possible regularities 7f in <5(7f , L) 
to 7f > d — 1. It allows us to control the bracketing entropy of Q(jf,L) with a parameter 
p = < 1. Hence, the construction of the ERM estimator can be directly (at least from a 
theoretical point of view) performed on this set and leads to the estimator: 

G n = arg min i^(G). (3.8) 

GeG{-y,L) 

Nevertheless, one may also define our ERM estimator on a network in a practical purpose, 
without significant change in the following results. 

Theorem S] below describes the performances of G n for the boundary fragment assumption. 
It seems to highlight a difficulty to get minimax results in this setting and do not entirely 
validates the lower bounds of Theorem 1. 

Theorem 4 Let J- = J- frag, G* K £ ^(7, L) with 7 > d — 1 and G nm the set introduced in (|3.8p . 
Suppose the noise assumption is satisfied with fid > 1/2. Conside a kernel IC^ defined as in (|3.1[) 
satysfying the Kernel assumption and such that K.d—i = n^~j/Cj is a kernel of order [7bJ with 
respect to the Lebesgue measure. Then, for all n G N, we have 



2(0+1)7 



7(t*+2) + (d-l)t* 



GeG(-y,L) 



Ed f>g (G n ,G* K ) < I " J -y 1 + sup R X K -R K (G) 



In addition, if 



Xi = n 7(2+.) + 2 Q 7Eti 1 ft+ 2 ^+ 1 ) Vie{l,...,d-1}, and \ d = \J, 

then 

Ed A (G n ,G* K ) <n- T ^ a 'W +r n (a,\,G* K ), (3.9) 

where 

n -r d {*,p,i) i s 

the optimal minimax rate of Theorem 2 and r n {a,\,G* K ) is a additional 

term defined as: 



r n (a,X,G* K 

f 

\f - g\\tC x * l { . eG * K } ~ ICx * l { . e G n} l - / (/ - 9)(1C\ * HeG* K } ~ * l { . e c n} 
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Theorem [5] underlines a lack of optimality of ERM estimator G n for the Holder boundary 
fragment assumption. It could be explain as follows. 

The first assertion of Theorem 4 deals with the excess risk of the procedure. As a result, in 
this case using the series of inequalities (|3.4p , it is straightforward to get the first assertion with 
a modified version of Lemma 1 in [23] applied to the noisy setting. However a control of the 
bias term is not yet available. To deal with a boundary fragment's type assumption, we have to 
write the bias term using Fubini as follows: 

R X K (G) - R K (G) = J(f-g)(z) J K{u)[l G {z)-l G {z-\.u)]dudz. 

The presence of / — g in the above integral seems problematic and an assumption about the 
behavior of (/ — g) at the boundary of G* K seems to be necessary to reach the minimax results 
of Theorem [TJ 

To avoid the presence of (/ — g) in the bias term, the second assertion of Theorem [3] proposes 
to control dj\(G n , G* K ). In this case, it is possible to control a bias term as follows: 

1 G - h G , x = / K(u)[l G (z) - l G {z - \.u)]dudz < K + A* 
J i=i 

and to approach the minimax rates of Theorem Q] thanks to an optimal choice for A. However in 
this case a residual term appears in the upper bound and the main problem is that any satisfying 
bound on this residual term is, up to our knowledge available. The minimax optimality of G n 
remains an open problem. 



4 Conclusion 

We have provided in this paper minimax rates of convergence in the framework of smooth 
discriminant analysis with error in variables. We consider two different assumptions over the 
complexity of the hypothesis space: plug-in type assumptions or boundary fragments. In the 
presence of plug-in type assumptions, we have proved minimax optimality reached by Deconvo- 
lution ERM. These optimal rates are fast rates (faster than ?i~2 ) when 07 > d + Y^t=i A anc ^ 
generalize the result of [2]. As shown in Table 1, the influence of the noise e can be compared 
with standard results in regression and density estimation with errors in variables of \\.4\ [T5] 
using kernel deconvolution estimators. 





Density estimation 


Goodness-of-fit testing 


Classification 


Direct case (e = 0) 
Errors in variables 


27 

n 27+1 
27 

n 27+2/3+1 


27 

n 4 7+i 
27 

n 47+4,8+1 


7(a + l) 
n j(a + 2) + d 

7(«+l) 
n 7(a + 2) + 2S + d 


Regularity 
assumptions 


/eS( 7 ,L) 

\m®\ - 


feW(s,L) 
\T[ V ](t)\ ~ \t\~P 


f-g€Vfr,L) 

\?to](t)\ ~ 1*1 Vi 



Table 1. Optimal rates of convergence in pointwise L 2 -risk in density estimation (see H4V > 
optimal separation rates for goodness-of-fit testing on Sobolev spaces W(s,L) (see e.g. JS/) and 
the result of the paper in smooth discriminant analysis. 



In the presence of boundary fragments assumptions, we state a lower bound which generalizes 
the lower bound of the direct case of [23]. However Deconvolution ERM does not reach this 
lower bound. As a result, an open problem is to find the minimax optimal rate of convergence 
in the presence of noise under boundary fragments assumptions. A possible way is to find a 
classifier reaching the lower bound of Theorem 1. An interesting direction for this purpose 
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could be to consider convex loss functions in the spirit of [5j- If we take a look at Theorem 4, 
standard ERM with hard loss suffers from a lack of regularity. Considering for instance SVM 
type loss, it could be possible to control the bias term in the Deconvolution ERM using the 
Holder regularity of the boundary. The robustness of SVM with respect to noise in variables 
could be an interesting future work. However, this paper seems to highlight that at the first 
glance plug-in type assumptions are more adapted to the presence of noise in classification. 

We conclude this discussion by some words on adaptation. It is important to note that con- 
sidering the estimation procedure of this paper, we are faced to two different problems of model 
selection or adaptation. First of all the bandwidths proposed in this paper clearly depend on 
parameters which may be unknown a priori (e.g. the margin a or the regularity of the bound- 
ary 7). In this sense, adaptation algorithms should be investigated to choose automatically A 
to balance the bias term and the variance term. The second step of adaptation would be to 
consider a familly of nested (Gk) C G and to choose the model which balance the approximation 
term and the estimation term. This could be done using for instance penalization techniques, 
such as [21] or [TO] , 

This work can be considered as a first attempt into the study of risk bounds in classification 
with errors in variables. It can be extended in many directions. Naturally the first extension 
will be to state the same kind of result in classification. Another natural direction would be to 
consider more general complexity assumptions for the hypothesis space G- In the free noise case, 
[3] proposes to deal with Local Rademacher complexities. It allows to consider many hypothesis 
spaces, such as VC class of sets, kernel classes (see [27]) or even Besov spaces (see [23]). Another 
advantage of considering Rademacher complexities is to develop data-dependent complexities to 
deal with the problem of model selection (see [191 [3] ) and to deal with the problem of non-unique 
solution of the empirical minimization. 

Into the direction of statistical inverse problem, there are also many directions of study. A 
natural direction for applications would be to consider unknown density r\ for the random noise 
e. this is a well known issue in the inverse problem litter ature to deal with unknown operator 
of inversion. Another natural extension will be to consider general linear compact operator 
A : f 1— > Af to generalize the case of deconvolution. In this case, ERM estimators based on 
standard regularization methods from the inverse problem litterature (see [13]) appear as good 
candidates. This could be the material of future works. 

In this paper, a classifier G : X — > y is always identified with a subset of R d . Our aim is 
then to estimate the set G* K from the noisy observations (jl.3|) . In particular, the main goal is 
not only to provide a good classifier but also to understand the relationship between the spatial 
position of an input X *E M. d and its affiliation to one of the candidate densities. One could al- 
ternatively try to provide the best classifier for a noisy input Z from a noisy training set. These 
two problems are certainly comparable, although a rigorous comparison of the two framework 
and the respective error of classification should be done. We mention for instance [21] or [22] 
for a related discussion in a goodness-of-fit purpose. This could be the core of a future work, 
but it requires the preliminary study provided in this paper. 



5 Proofs 

In this section, with a slight abuse of notations, C, c, c' > denotes generic constants that may 
vary from line to line, and even in the same line. The notation a ~ b (resp. a < b) means that 
there exists generic constants C, c > such that ca < b < Ca (resp. a < Cb). 
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5.1 Proof of Theorem [T] 



The proof starts as in [23] but then uses some arguments which are specific to the inverse problem 
literature (see for instance [8] or [26J). 

Let T\ a finite class of densities and go a fixed density such that (/, go) G ^jf ra g f° r an / G F\. 
The composition of T\ and the value of go will be precised later on. Then, for all estimator 
G nt m of the set G* K , we have 



sup %, s dA (G n7 m,G* K ) > sup Ef tg dA(.G ntm ,G* K ) 



> E 



.(5.1) 



5.1.1 Construction of J 7 ! 

Concerning the density go, we deal with the uniform density on [0, l] 2 , i.e. 



So 0*0 



L {xe[o,i] 2 }' Va; G 



Now, we have to define the class T\. First, we consider a function tp infinitely differentiable 
defined on R such that supp(yj) = [—1, 1], tp(t) > for all t G R and ||v||oo = y(0) = 1. Let 
M > 2 an integer which will be allowed to depend on n and r > a positive constant. Then, 
for all j G {1, ... , M}, we set 



</>j(t) = rM" 7 ^ M 



For all w G {0, 1} AJ and all t G R, we define 



2J-1 
M 



Vt G 



1 M 



In the specific case where w,- = 1 for all j G {1,...,M}, we write b(t, 1). Then, let fro an d 
C* positive constants which will be precised later on. We define the function fo : R 2 — 
/o(x) = for all x [0, l] 2 and 



as 



/o(x) 



f 1 + 2t7o,Vx 2 G [0,1/2], 

1 - r]o - b ,Vx 2 G [b(xi, 1), 1], 



1 + 



C2 



1/a 



C*M-T/ Q ,Va;2 G [1/2, 6(xi, 1)], 



where C* = 3/2.(t/c2) 1 / q and &o > is such that J fo(x)dx = 3/4. The condition on C* ensures 
that fo(x) < 1 for all x 2 G [1/2, b(x\, 1)]. We will also use the function f\ defined as 



0,Vx G [0,1] 2 , 



(1 +X2 )2 C1 (1 +X1 )2,VX [0, l] 2 , 

where Ci is such that f fi(x)dx = 1/4. Finally, the set T\ will be defined as 

T\ = {/«, ^G [0,1] A/ }, 
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where for a given oj G {0, 1} 



M 



M 



Ux) = f Q (x) + h(x) + J>, Pi (x). (5.2) 

i=i 

for some functions (pj)j=i...M which are explicited below. In order to complete the construction 
of the set J-\, we have to provide a precise definition of the pj and to prove that the f u define 
probability density functions for all to G {0, 1} . 



We first start with the construction of the pj. For all x G R, let p : 
defined as 



[0, 1] the function 



p(x) 



1 — cos(x) 



Vx G 



TTX' 



with associate Fourier transform F[p\{t) = (1 — \t\) 
j G {1, . . . , M} and X2 G R, introduce 

, fx 2 ~ 1/2(1 +TM-T) 

^ 2 ) (X2) = COS V 3/2vr-irM-7 
By the same way, for all j G {1, ... , M}, we define 



In particular, supp F[p\ = [—1,1]. For all 
x 2 -1/2(1 +rM-T)" 



7T / XI - j/M 

3 V M- 1 



37r- 1 rM-T 



7T / XI - j/M 

6 V M- 1 



Then, for all j G {1, . . . , M} and x = (xi, x 2 ) G [0, l] 2 , we set 

p,(x) = c 4 (r*n) 1/a /0(2) (x 2 )p jj(1) (x 1 ), 
for some constant c* explicited below. 



(5.3) 

(5.4) 
(5.5) 



Now, we prove that the introduced in (|5.2p define density functions. First, remark that 

AI 



CM~^/ a (l + xi)" 2 (l + x 2 )~ 2 , Vx [0, l] 2 , 
CM~^ a , Vx G [0,1] 2 , 



This ensures that f u > for all a; G {0, 1} , at least for M large enough. Then recall that 
both /o and f\ are designed in order to guarantee that J(fo + fi)(x)dx = 1. Hence, we only 
have to show that J pj(x)dx = for all j G {1, . . . , M}. In fact, it is only necessary to prove 
that J P( 2 )(x 2 )(ix 2 = 0. First remark that f /0( 2 )(x 2 )(ix 2 = J /5( 2 )(x 2 )(ix 2 where /5( 2 )(x 2 ) = 
P(2)( x 2 + 1/2(1 + rM~ 7 )) for all x 2 G R. Then, using simple algebra 



= -Hp 



Stt^tM-i 



} (±— t \ 1 = -tt-VM^W (±2) = 0, 

J V 3/2tt- 1 tA/-t J 2 lpj v ; : 



since the support of the Fourier transform of p is [—1; 1]. Hence, for all uj G {0, 1} M , f w is a 
density function. 

In order to conclude the proof, we have to show that 

(f u ,go) e^irag V^g {0,1} M , (5.6) 

which allows to use the bound f)5. 1 [) . 

Q {x G K : \f u (x) - g (x)\ <7]}< c 2 r] a Vu; G {0, 1} M and Vr? < m , (5.7) 
which means that the Margin assumption is satisfied for our test functions and that 

E 90 E^ {d A (G n>m ,G* K )\x[ 2 \...,X$} > Cn~^ +1 ) + ^ + ^ +1 , (5.8) 
for some positive constant C. 
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5.1.2 Main assumptions check 

We first start with the proof of (|5,6p . First remark that for all j G {1, ... , M}, the function 
Pj(-) is bounded from above by CM~^/ a for some C > 0. Then, using simple algebra 

S2€[l/2;6(xi,l)] -<x 2 <-+rM-r, 

tM-~< 1 tM-~< tM-~> 

=> < X 2 < , 

2 - 2 2 - 2 ' 

7T X2 — 1/2(1 + tM _1 ) 7T 
=>• < — 7 , < -, 

6 ~ 3tt~ t M~"t ~ 6' 



P(2)(^2) > 



47T 3 ' 



The same kind on minoration holds for the function Pjji)- Hence the pj are uniformly bounded 
from below on [1/2; b{x\, 1]. For all u G {0, 1} M and for all x G [0, l] 2 , we have 

Ux) > 1 + f ^l2^l Y /a > 5o(x ) 5 V x 2 G [1/2, 6(^,0;)], 

for c* large enough. This ensures that 

{x G [0, l] 2 : f u (x) > g (x)} = {x G [0, l] 2 : < x 2 < b(xi,u)}. 

In order to conclude the proof of (|5.6p . we only have to remark that the function b(.,uj) belongs 
to S(7,L) for all u G {0, 1} M , at least for M small enough. 

Now, we consider the margin assumption (|5.7p . First, we consider the case where r] < 
[tc^ 1 ] 1 ^ M~^/ a < t]q. Clearly, following our choices of &o an d C*, we have that 

\fuj(x) - g (x)\ < i] => x 2 G [l/2;&(zi,w)] =^ x 2 < 

Moreover, for all x G [0, l] 2 such that x 2 < b(xi,cj), we have 

l/a A/ 



(fu-9o)(x) 



fb(x,l)-x 2 \ ' sr- r \ r<*n*—r/a 
y ) +2^ ^jPi^) ~ C M 1 > 



where 



Thus 



M 

J^/j^s) - C*M- 7 / a > 0, Vx 2 G 



2 ,&(xi,w) 



" 0o(s)| < r? =► ( ^'^ ^ y 7 " < r? =► x 2 > 6(x 1; a;) - c 2 rf , 

which proves the margin assumption when 77 < [rc^ 1 ] 1 / ^/" 7 / . Now, in the case where 770 > 
77 > [Tc 2 - 1 ] l / a M-~i/ a , we have 

\fu(x) - go(x)\ <r]^l/2<x 2 < 6(xi, 1), 

which entails 

Q{x £ K : \f u (x) - go(x)\ < rj) < tM~*I < c 2 rf . 
This concludes this part. 
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5.1.3 Final minoration 

Now, we can deal with the lower bound (j5.8|) . The proof is based on classical tools which can 
be found for instance in [29], [23], [8] or [26]. First remark that the shape of G* K depends on the 
value of oj. For the sake of convenience, we omit the dependency with respect to this quantity. 
For all u) G {0, 1} A/ , recall that 

G\ = {x£ [0, l] 2 : f u (x) > g (x)} = {x£ [0, l] 2 : < x 2 < b(x u u;)}. 

Using Assouad Lemma and classical tools designed for instance in [29] . we get 



E 



i M f 

^A(G n „ m ,G^)|yi,...,y m J > yll^iHi / min[dP lu dP 10 }, (5.9) 



where P\\ denotes the law of (^i 1 ^)i=i... n when the density of the is / Wll . In the following, 
we will choose M in order to guarantee that the term j min [dPu, dPio] is bounded from below. 
Consequently, the lower bound will be determined by the corresponding value of M||</?i||i. Since 
the observations are independent 



^ mm[dP 11 ,dP 10 ] >l-<J(l + X 2 (Pl,Po)) n -l, 

where x 2 (P a ,P&) denotes the chi-square divergence between two given probability measures P a 
and P5, and Pq,P\ are the law of the variable = x[^ + when the density of the Xi 
is respectively f Ull or f uw . In the following, our aim is to find a satisfying upper bound for 
X 2 (Pi,P ). 

First, remark that we can find c > such that for all x [0, l] 2 and all lo £ {0, 1} M , 
fu)(%) > cfi(x). Hence, using simple algebra, we get that 



f u *v(x) > ° - 2 , VxgM 2 , (5.10) 



for some C > 0. In the following, given /, rji and 772 , we denote by / * r] the convolution product 
in dimension two, i.e. 



f*rj(x)= / / f(xi - yt,x 2 - y2)m(yi)V2(y2)dyidy 2 , Vx € 
Jr Jr 

Then, using (JO]) and (pUOj) . 

Jr Jr Mi * 7?W 



< 



C / [(l + xj)(l + x 2 2 ){ Pl * V (x)} 2 dx. 
Jr Jr 



Hence 

X 2 (P, Po) < C / /{pi * ^(x)fdx + C [ [ x 2 { Pl * V {x)} 2 dx 
Jr Jr Jr Jr 

+C / / x\{ Pl * V (x)} 2 dx + C / / x\x\{p\ * r/(x)} 2 dx, 
Jr Jr Jr Jr 

:= A 1 + A 2 +A 3 + A A , 
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where the pj are defined in (|5.5p . In the following, we only consider the bound of A±, the other 
terms being controlled in the same way. We get 

M = C I f { Pl * V (x)} 2 dx, 

JR JR 

= CM~ 2l/a \ P(2)(x 2 -y2)Pj,(i)(xi-yi)m(yi)m(y2)dyidy 2 \ dx, 

JR JR I JRJR ) 

= CM~ 2 ^ I [ |J-[ P(2) ](i 2 )| 2 |^i,(i)](ii)| 2 |^i](ii)| 2 |^2](t2)| 2 dti^ 2 , 



where 



.4 



1,1 



CM~ 2 ^ a A hl A h2 , 



|^(i)](ii)lVM(*i)| 2 ^i, A 1>2 



l^[Pi,(2)](i2)| 2 |-F[r?2](t2)| 2 ^ 



and pm, P1J2) are respectively defined in (|5.3p . (|5.4p . We first deal with the term A± t2 . Using 
simple algebra, we get 



.4 



1.2 



U ± 



3/27t- 1 tM-t 



\Hm](ti)\ 2 dt u 



I^Kti)! 2 ^! 



Then, setting s± = 3tt 1 tM 7 ti and using the Noise assumption, we obtain 



A 1>2 = Stt-VM-^ / \T\p]{ 8l ±2)\ 2 



si 



si 



dsi, 



dsi, 



< CM-~i- 2 ^ / \T\p\{s x ±2)| 2 |si|- 2 ^dsi, 

Using a similar algebra for the term A\ t \, we obtain 

Ai )2 < CM" 1_2ft . 

Similar bounds are available for A 2 ,A% and A^ since T\p\ and its weak derivative are bounded 
by 1 and supported on [—1; 1]. In particular, we use the fact that for all t 6 R 

T[p W) ](t) = 3vr- 1 rAf- 7 ^[p](3^~ 1 rM^t ± 2), 



and 



j?[PU2)]{t) = -i{^- l rM-') 2 t.F[p\{^- 1 TM-n±2), 
for all t in a subset of R having a Lebesgue measure equal to 1 . 

The above equations lead to the following upper bound: 



X 



\Pi,Po) < CM~^ 2 / a+ V- 2 ^- 2 ^~\ 
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Then, % 2 (Pi, Pq) < C/n for some constant C > as soon as 



M = M n ~ n 7(2/a+l) + 2/3 1 +2/3 27 +l 

Finally, going back to equation (|5.9|) . we obtain 



E 



d/s.{G n ,mi G* K )\Y\, . . . , Y n 



>s M « II II 

> CM n ||^i||i, 



J min [dPn,dP\o] 



■/ o 



M~ 7 



" 7 (2/ a +l) + 2/3 1 +2^ 27 +l 



which concludes the proof. 
5.2 Proof of Theorem |2j 

The proof mixes standard lower bounds arguments coming from classification (see pQ and [2] ) but 
then uses some techniques which are specific to the inverse problem literature (see for instance 
or [26]). 

Consider Ti = {f-$, a = ■ ■ ■ ,&k) £ {0, +l} fc } a finite class of densities with respect to 
a specific measure Qo and go a fixed density (with respect to the same Qo) such that (fe,go) G 
•^piug f° r all £ { — 1) + The construction of f-j as a function of lr , the value of go an d the 
definition of Qo will be precised in Section 5.2.1. Then, for all estimator G n>rn of the set G* K , 
we have: 



sup ^f,gdA(G n)7n ,G* K ) > sup E go 



E 



n,m ? 



G 



(5.11) 



In a first time, we propose a triplet (J-2, go, Qo)- Then we prove that each associated element 
satisfies our hypotheses. We finish the proof with a convenient lower bound for (|5.1ip . 

5.2.1 Construction of the triplet (J^jffOiQo) 

We only consider the case d = 2 for simplicity, whereas straightforward modifications lead to 
the general (f-dimensional case. For go, we take the constant 1 over M. d : 

g (x) = l,Vx G R d . 

For any z G R d and positive 5, we write in the sequel B(z, 5) := {x = (xi, ■ ■ ■ , Xd) '■ \xi — z.-\ < 5}. 
For an integer q > 1, introduce the regular grid on M. d defined as: 



G„ 



2pi + l 
2<7 



2p rf + l 
2g 



,Pi G {0, . . . q - = 1, . . .d \ . 



Let n q (x) G G g the closest point to x G M d among points in G q (by convention, we choose the 
closest point to when it is non unique). Consider the partition (x'j)j=i,... q d of [0, l] d defined as 
follows: x and y belongs to the same subset if and only if n q (x) = n q {y). Fix an integer k < q d . 
For any i G {1, . . . k}, we define \i = x'i an d Xo = uf =1 Xi to get (xi)i=l,...,k a partition of 



Then, we consider the measure Qo defined as dQo{x) = p(x)dx where /i(x) = po( x ) + p±(x) 
for all i£R 2 with 



Po(x) = kujp{x\ — l/2)p(x<2 — 1/2) and pi(x) = (1 — kuj)p{x\ — a)p(x2 — b) 
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where k, u, a, b are constants which will be precised later on and where for all x G R, p : R — > [0, 1] 
is the function defined in the previous lower bound as 

1 — cos(x) , , m 

p(x) = Vx G R. 

irx z 

It seems clear that the function g define a probability density w.r.t. to the measure Qq since 
/ R a p(x)dx = 1. 

Now, we have to define the class Ti = {jV, o }. Denote by (Zj)j=i,."fc the centers of the x'j s - 
We first introduce (p as a C°° probability density function w.r.t. the measure Qq and such that 

tp(x) = 1 - cV 7 Vx G [0,1] 2 . 
Now introduce a class of function ipj : R 2 — > R, for j = 1, . . . , k defined for any x G R 2 as follows: 
^j(x) = q~ 1 c^p{2-Kq(xi - z{))p(2irq(x2 - z 3 2 )) cos(47rg(xi - z^)) cos(47rg(x 2 - z 3 2 )), 

The class (^j)j is specific to the noisy case and the inverse problem literature (see [8] and 
(26]), and mimics the construction provided in Theorem 1. Recall that p satisfies J-[p]{t) = 
(1 — |i|)+, and will allow us to take advantages of the regularity assumption over 77 in the noise 
assumption. 

With such notations, for any G {0, l} d , we define: 

k 

fr{x) = <p(x) + <rMx), Vx G R 2 . 
1=1 

Now we have to check that this choice of T% , go and Qq provides the margin assumption and 
that the complexity assumption hold true. 

5.2.2 Properties of the triplet (T2, go, Qo) 

In a first time, we prove that the f-g define probability density function w.r.t. the measure Qq. 
Let G {0, l} k . Remark that, considering the case d = 1 w.l.o.g: 

/ ipi(x)p (x)dx = J r [^ / u ](0) = c^q~ 1 F[p(2'Kq.)pQ{.)]{±A-Kq) = c^q~ 1 ku)F[p\*F[p{2'Kq.)]{±A'Kq). 
Jr 

Then, since 
and 

F[p{2itq.)](t) / 0^-l< — < 1 -2wq < t < 2nq, 

lixq 

we get 

suppj"[p] * F\p(2itq.)) = \-2-Kq - 1; 2vrg + 1] and / ipi(x)p (x)dx = 0. (5.12) 

Jr 

This proves the desired result. 

Concerning the regularity, f-g G £(7, L) for q large enough since f-g can be written as 
q~ 1 Fo{x) where Fq is infinitly differentiable. 

In order to conclude this part, we only have to prove that the margin hypothesis is satisfied for 
all the couples (f-j-,g). Concerning the parameters k and uj we will use the following asymptotics 

' ku = 0(q- a ^), 
< k = q d , 

w = q~ al ~ d . 
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Then, we will distinguish two different cases concerning the possible value of t. The first 
case concerns the situation where Ciq~~ t < t < to for some constant C\. Then, thanks to the 
construction of \x§: 



/i |xG %\\ d : \fy( x ) - g{x)\ <t\ < [ 
L ' J [OA 



Ho(x)dx <ku< Cq~ a ^ < Ct a . 



Now, we consider the case where t < C\q We have in dimension d = 2 for simplicity, 
Vcj G {0,l} fc : 



(f,~g)(x)\<tdx 



k 

Ho {x G [0, l] 2 : \(f a - g)(x)\ < t] = \ ku)luf a _ g - ) r x) \< t dx < ku V / 1 

< k 2 uLeb{x£ X i-\(fa-g)(x)\<t}, (5.13) 

where without loss of generality, we suppose that ax = 1. 

Last step is to control the Lebesgue measure of the set W\ = {x G xi '■ \(fa — d)( x )\ — t}. 
Since f a — g = Ylj=i a j' l l J j ~ C*q _1 , we have: 

k 

W x C {x G xi ■ I Y^^jix)\ < t} C {x G xi ■ < *} ■= W{, 

i=i 

noting that V/ ^ j, sigat/jj = sigm//-. We hence have to control the size of W[. The idea is to 
approximate tp x at each x G W[ by a Taylor polynomial of order 1 at z x := argmin^.^^wo \\x— z\\ 
as follows: 

ipx(x) = ipx(zx) + Vil>i(*%,4) ■ ( x - z x ) + 0(\\x - z\\). 
Hence we have by construction, since Vx G W{, there exists i G {1, 2} : X{ = zf : 

Leb(W{) < Leb{xexi--\Mzx) + ^i(4,4)-(x-z x )\<t} 
< cLeb {x G Xi '■ QQ^^i — %\ \ < t} 

~ C q 2 q~^ 

Gathering with (j5. 13fl . we hence get, for t < Cxq~ J , provided that a > 1: 
/i {xG [0,1] 2 : \{U-g){x)\ <t] < ck 2 uj 



< ckco— = a f f ( 1 - a h a i L - a < c't a . 
q-r 



5.2.3 Final minoration 



Suppose without loss of generality that n < m. Now we argue as in [T] (Assouad Lemma for 
classification) and introduce v, the distribution of a Bernoulli variable {y{p = 1) = v(a = 0) = 
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1/2). Then, denoting by P^ n the law of {z{ 1] , . . . , Z K n>) when / = we get 

sup E f \d A (G n>m , G* K )\Z[ 2 \ . . . , Zg> } 

■^6{-l,+l} 1 J 
> E^kEfedAiGn^m, G* K ), 



> E„ m E u J2 [ 1 ( x e Gn, m AG* K )Q (dx), 

j=l J Xj 
k 

E / E K^) / l(^GG' n , m ( w )AG^)Q(dx)P^(^) 



> E, 



,®(fe-i) 



3=1 

k 

E 

3=1 



n®n p<g>n 



(5.14) 



where a,> = (<ti, cr,-_i, r, <7j+i . . . , a^) for r G {0,1} and £>j is defined above. Now introduce 
binary valued functions: 

f(x) = l(x € G„, m ) and /^(x) = l(x £ G* K ). 
Then since Yltej a li ) l{ x ) — ^j( x ) ~~ ( see 5.2.2), we have coarsely for any c^: 



Vx G X jJ^{x) = < 



<Tj for x £ Bj, 



(5.15) 



otherwise, 

where J3j ={i£ : Vi |xj — Zj^l < |}. Now we go back to the lower bound. We can write: 
E^) / l(x G G n , m (o;)AG^)Qo(dx) = E„ (dff .> / 1(/ / f^)Q Q (dx) 



> E 

1 

2 

1 



K^j) / 1 (/ 7^ crj)Qo(dx) 



[l(/>l) + l(/Vo)]Qo(dx) 



where we use (|5,15p at the second line. Then it follows from (|5.14p that: 

sup E f \d A (G n , m , G* K )\Z[ 2 \ . . . , zg>\ 
-^e{-i,+i}* 1 J 



> E 



„®(fe-l) 



E 

3=1 



p®n p®n 
°J,0 . °j,l 



(M\ f Q (dx)P^ 



UJ) 



= ^E^-vil-ViWf^^ f Qo(dx) 

3=1 B i 

> ^E^.^l- Jx 2 (Pf\,P| n }4 / Qo(dx) 
= ^[l-^TT^PT^F 3 !)^] / Qo(dx), 

3=1 ^ 



(5.16) 
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where P±i is the law of when / = f-g with a = (±1, 1, . . . , 1). Then we can write, if 
X 2 (Pi,IPo)< g: 

sup K f ^ t g dA(Gn,m,G* K ) > c'S2 Q (dx) = c'kw, (5.17) 
«e{-i,+i} k j=l Jb 3 

where we use the definition of Qq. 

Next step is to find a satisfying upper bound for x 2 (Pi; Po)- We have, by construction of 



I - iv,-l)M - - 



r [C/V.+i- U,-i)t^*v] 2 dx | r KU,+i - /g^-iW * *ff d 

y * ?? y 



The r.h.s. term can be considered as engligible with a good choice of the parameters a and b. 
Hence, we concentrate on the first one. First remark that for all x € R d 

*V> C YT^2 > and {Cfe,+i - H ,-iVo} * ?? = g _7 {^/^o} * »/(a;) = q^kutyip} * i](x). 

Hence 

X 2 (Pi,F-i) < Ckuq-*r\\(ih.p)*ri\\ 2 . (5.18) 

From the definition of ip\ and the conditions on 77, we have in dimension d = 2 for simplicity: 

2 



ip) * ??ll 2 = / ftM * ^) 2 ^ = f[ / Wip](*OI 2 I^M(*i)l 2 d<* 
2 /■ 

= [ / \F\p{2nq-) P ]{U - 47rg)| 2 \F[ m ]{U)\ 2 dt t . 



i=i 



Using (j5. 12|) . the noise assumption, and the fact that q — > +00, we get 



- 2 r 

= Cq- 2(3 11 I \F[p(2irq-)p](ti - Airq)] 2 dU 
i=i 

= Cq- 2 P\\p(2irq.)pf, 

< Cq- 2 %(2, q .)f<Cq-^-\ 



Using (|5.18p . one gets the following control of the quantity x 2 (Pi 5 P-i) in the general (f-dimensional 
case: 

C . , L 



X 



'(P^ 1 ,F^ _ x ) < Cq -1~1-<*~1-d-2[i < with q _ n2l+al+d+2fi ^ ^_ 19 ) 



77 



Now using (|5.17p . 

sup E f ^d A (G n>m , G* K ) > c'kw = c'q~ ai = c'n wf+d+w ? 

cre{-l, + l} fc 

which concludes the proof of the second lower bound. 
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5.3 Proof of Theorem H 

The proof is presented for d = 2 for simplicity whereas straightforward modifications leads to the 
d— dimensional case. In the sequel, we identify each v £ S'(7 P , L) with a set G v = {x : v(x) > 0}. 
By the same way, we identify G* K with v* = f — g. 

For all G v := {v > 0}, we have, using the notations of Section 3: 

Rn,m{Gv) — Rn,m{G*K) — R^(G U ) + R\(G*k) 
n n 



where, for all i £ {1, . . . , n}, 



i=l 



2m 



i=l 



r(l), 



and 



Then, for all i G {1, . . . , n}, using Lemmas [3] and 4 in Appendix we get 

E[[/,(G)] 2 < cX^\^d A {G,G* K )<dX^\- m d f , g {G,G* K )^-\ 

and 

2 

|f/,(G)|<CnA. ft " 1/2 , 
1=1 

for some constant C > 0. The Bernstein's inequality leads to 



P 



1 n 



i=i 



> a I < 2exp 



Cna 2 



for all a > 0. Since > 1/2 for all i 6 {1, . . . , d}, the particular choice a = df tg (G u , G* K ) yields 



P 



71 



i=l 



> df, g (G,G* K ) I < 2exp -CnA^A^d^CG.G^f-^i 



\2- 



2 exp 



CnAf 1 A^ 2 d /iS (G,G^)^ 



Using the same algebra on the V^Gj,), we get 

P{\T n (G u )\ > d Lg (G u ,G* K )) < 2exp [-Cn\f x Xf 2 d f , g (G, G* K )%* 

This concludes the first part of the proof. Let t a positive parameter which will be chosen further 
and introduce the set Q' defined as 

g' = {GeAf Sn ,d Lg (G* K ,G)>t5 1 n +a }, 

where Ms n is the 5 n network introduced in Section 4.2. Using the upper bound above, 

p(3Geg':\T n (G)\>±d ftg (G,G* K )\ < £ P (\T n (G)\ > \d f , g (G, G* K )Y 

Geg' ^ 



< 2ex P - CnX T lx T 2d fA G i G *K)^ 



Geg' 



< 



^2exp -CnXf'Xfi {t8l +a )^ 



Gag 1 



< 2exp -CnXl Pl Xf 2 t^5l +a 

Geg' 
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Since log card (A/^, J = A5 n 2 ^ ', we get 



p(3GeG' : \T n (G)\ > \d f>g {G, G* K ) ) < exp AS~ 2 ^ - CnXf 1 xf 2 5 2 + c 



Thanks to the value of S r , 



Hence 











< exp 




= exp 









-2/7. 



2/-Y + 2+a 



(5.20) 



Now, using Lemma [5] in Appendix, we can find a set G n G A/"<5 n such that: 

d ft9 (G* K ,G n ) < \\u* - v n W£ l < CoC a , 
for some positive constant Cq. Then, for all G G we get 

\d f , g (G,G* K ) - l -d Lg {G n ,G* K ) > \5 l n +a - |#+« > ^5l +a , 

provided that t > 4Cq. We eventually obtain: 

p(d f , g (G* K ,G n )>t6 1 + a ] 
< P (3G G g' such that R n {G) < R n {G n )) , 

= p(3Ge Q' such that \d\ g {G, G* K ) + Z n (G) - \d\ g {( 
where for all Gi,G 2 C K, dh (Gi,G 2 ) is defined as 



1 
2 



df >g (Gi,G2) — R\{Gi) — i?^(G 2 ) 



Last step is to control the bias term. For all Gi,G 2 C K, we can remark that 
{R X K - R K ){Gt - G 2 



< 




2 

z — x 



X 



+ 




1 



f{z)dQ{z) - f{x) 
g(z)dQ(z) -g(x) 



< 



X \ X 
|/Ca * v(x) — nu{x)\ dQ(x), 



[l(x G Gf ) - l(x G G$)] dQ(x] 
[l(x G G\) — l(x G G2)} dQ(x) 



'G1AG2 

< \\tCx * v - ^||oodA(Gi,G 2 ), 

< Cd A (G 1 ,G 2 )[A7 + A]], 

for some C > 0, provided that for z/ G £(7, £) and K, a kernel of order / = [7] : 

\\ICx * v - i/Hoo <C[A7 + A^]. 

Using the Young inequality 

xy r < ry + (1 — r)x 1 ^ 1 ~ r \ 
with r = + 1), we get for all G\, G2 C K 

7(1+0) 

- - G 2 ) < (1 - r) 7 1 /( 1 " r) [A? + A§] 5 + j-V r d f>g (G u G 2 ), 
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(5.22) 



(5.23) 



for some 7 > 0. Hence, it follows from (|5.2Q[) - ()5.21j) that 
p[d f , g (G* K ,G n )>t8 1 n +a ) 

< P f 3G € Q' such that \d Lg {G, G* K ) + Z n (G) - \d f , g (G n , G* K ) - Z n (G n ) + C J2 ^ °) 

< p(3G£ Q> such that Z n {G) < -\d f , g {G, G* K )j + P [z n {G n ) > C{8]+ a + £ A^ (1+q) )^ . 
In order to conclude, remark that the proposed choice of (Xj)j=i...d provides 

r = £ aj (1+<,) »* EM , (^) ^ = [a? + 

The end of the proof follows exactly the same lines as [2] . 
5.4 Proof of Theorem [4] 

Let us prove the first assertion. Using the definition of G^ in (|1.7p . we have: 



d f>g (G*,G* K ) < R K {G*)-R K {G* K )-R*{G*) + R*{G* K ) 

< R^{Gn) - Rn(Gn) + Rn{G*K) ~ R X (G* K ) + (R K ~ Rx)(Gn ~ G * K ) 

Consider the empirical processes Vn \ for j G {1,2}, defined as: 

1 n 

i=i 

Hence we can write: 



(5.24) 



J if- g)(Kx * l{. eG ^} - £a * l { . e G n} ) 



< ^(»£HG* C ) - ,«(G^)) + ^=(4 2) (G*) - ^(G*). 

Now denoting A = Ilf =l X i * 2 , c(A) = nf =1 A~^ and p = 2/7, consider the event: 

= {d A (G n ,G^) > c(A)~^n"TT>5A^}. 
We have on Q, using both the margin assumption and Lemma [3) 



J {f- 9){fCx * 1{. G G^} " * l { .gG n} ) 



1-p 



< 



d A 2 (G* G*)c(A) 



Vn\G* c ) - Vn\G ) £L) 



c{\)d/ (G$ m , G*) V c(A) d+p) n "^A^ 



+ 



^ 2) (G*)-^ 2) (G^ m ) 



c(A)d A 2 " (G* m , G*) V c(A) d+rin"^A^ 



1 — P Q 

2 a+1 /AA 



< df ' 9 a+1{G T G * )c{x \ v^ + v(% 
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where Vn^ is the random variable defined as, for j £ {1, 2}: 



sup 



Vn\G*) - v^\g)\ 



2p 1-p 1-p • 



Geg c(\)p\\h^ - h^4 l ~P (j) Vc(A)< 1 +")n 2+^A^+7 



(5.25) 



A generalization of Lemma 5.13 of [16] provides that the variable V^ 1 '* + has controled 
moments. We can write, using Young's inequality xy r < ry + (1 — r)x l '^~ r ^ for 



/(/ - g){K x * l le6n} - K x * l { . eG » a ) < c {^[VP + K (2) ]) 
Finally we conclude that on fi: 



2(a+l) 



l— p a . 
2 a+1 • 



(5.26) 



d f ,M,G* K ) < C (^[^ (1) + K (2) ]) 



2(a+l) 



+ sup 

GeQ(rt,L) 



Rk-Rk (G). 



This allows us to get the first assertion of the Theorem since we have on fi , we have from an 
easy calculation: 

2(q + l) 

c(A)\ "+2+P" 



^a(G, (Tr- ) < c(A) l +pn 1 +pA 1 +p=c(\) 1 ~pn i-p < 

provided that A = (Ai, . . . , A^) is choosen small enough to ensure the last inequality. 
To get the second assertion, W6 apply 1^61 III I1M [T| with. G\ — ^nm ■ — 

G and G 2 = G* K and 

Lemma H] in order to get 



dA(G n , m ,G* K ) < C\\K\ * lr eG — K\ * 1{.sg* }||i + ^i + M 



+ A? + A 2 . (5.27) 



Since /Ca * l{.eG} £ [0; 1] f° r an y G £ Q(j,L), we get 
y |/ - 5 ||/C A * l{. eG * K } ~ K\ * l { . eGn} l 

< J \f- g \\l G * K -)C x *l { edn} \+J |/- 5 ||/C A *l { . eG , f} -l GI J 
= /(/-5)(l G ^-/C A *l { eGn} ) + | |/_ 5 ||/C A *l { . eGk} -l G|f |, 

< |(/ - g) (jC x * l { . eG ^ } - K X * l {eGn} ) +2j\f-g\\K. x * l { . eGJ . } - 1 G ^|. 

which gives, gathering with (|5.27|) : 

dA\G nim , G*k) 



< C [R x K {G x n ) - R X K (G* K ) + 2 J |/ - g\\l G * K - K x * l { . eG|J 
Finally using Lemma [TJ (|5.28|) and (|5.26|) . we have on f2: 

d A (G,G* K )<c( C %V^+V^ X 



a/a+l 



+ C{XJ + A 2 )(5.28) 



]) + E AJ + X d + (/ 1/ - dllGfr - ^|) 



q/q+1 



Integrating the above inequality, we conclude the proof noting that on Q c , we have from an 
easy calculation: 

d A (G,G* K ) < c (A)"^n"^A^ = (uf =1 X in j~ p < n - T ^ a '^\ 
provided that 2/3i + 2/3 2 7 + 1 > 7, or in particular when /3 2 > \. 
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6 Appendix 



For the sake of convenience, we assume troughout this section that the kernels QCj)j=i„.d are 
compactly supported. This assumption can easily be relaxed up to a more complicated algebra. 



Lemma 1 For any G\, G 2 € £(7, L), and any A > 0, we have: 



'd-\ 



7/2 



d A (Gi,G 2 ) < c\\JC x * l { . eGl} -/C A * 1{.gg 2 }IIi + \J2 X i) +Xd - 



Proof. For the sake of convenience, we only give the proof in the particular case where d = 2. 
Using the equality \a — b\ = a + b — 2 min(a, b), Va, b G R, we can write: 

II^A * l{.ec?i} - * l{. G G 2 }l|l 

= y * l{.eGi} + J JC X * 1{. £ g 2 } - 2 ^ min(/C A * l { . eGl} ,/C A * 1{. G G 2 })- 

Then for any G £ £/(7,L), remark that 



/ K-x * l{xeG}dx2 - b(xi) 
•/fo.H 



[0,1] JK 2 



/ W X 

— K ( — ; — ) l(u2 < b(u\))dudx2 — b{x\) 



X V X 



< 



1 / X\ 

— K, — ) l(x 2 < b{u\))du\dx2 — b(x\) 

[0,1] JR *1 V -^1 



+ CA 2 



X ^1 ( y 

< C(A7 + A 2 ). 



I b(ui)dui — b(x±] 



+ CA 2 , 



(6.1) 



Moreover, noticing that J min(/, g) < min(J /, J g), we have, using (|6.ip 
J min(/C A * l { . eGl} ,/C A * 1{. GG2 }) 

< / min / !C\*li eGl \dx 2 , 1C X * li eG2 \dx 2 ) dx 1 , 
J [0,1] \/[o,i] J [0,1] / 

< / mm(b 1 (xi),b 2 (x 1 ))dx 1 +C(\J + X 2 ). 
Jo 

Finally we arrive at the conclusion: 

d A (G 1 ,G 2 ) = J \h - b 2 \ = J h + J b 2 - 2 J mm(bi,b 2 ) 

< J h + J b 2 - 2 j min(/C A * l { . eGl} ,/C A * l { . eG2} ) + 2c'[Xj + A 2 ] 

< ||/C A * l { . eGl} - /C A * l { . eG2} ||i + 2C[X\ + A 2 ], 
for some positive constant C. 



□ 



Lemma 2 For any (/, g) satisfying the margin assumption with parameter a > 0, we have: 

d Lg (G u ,G* K )<\W-u*\\^\ 
where G u = {v > 0} and u* = f — g. 



28 



The proof is a straightforward modification of the proof of Lemma 5.1 in [2] which state a 
similar result in the binary classification framework. 

Lemma 3 Assume that r\ satisfies the Noise assumption and let fC^ the deconvolution kernel. 
Then we have, 



(i) E[fcG )A (Z) - h G , iX (Z)] 2 < d A (G, G') H A, 



-2/3; 



(ii) sup \h G ,\(x) - h G ',x(x)\ < TT A, 



i=l 



1=1 



PROOF For the sake of convenience, we only consider the case where d = 1. We first prove (i). 
We have: 



E[h G>x {Z) - h G , >x (Z)}' 



A 



z — X 



1 2 



(!{:eGG} - l{xGG'}) 1 {a;eA'}^Q(a ; ) 



< c/ -^|J-[^(./A)](t)| 2 |^x (l { . eG} -l { . eG , } )l { . eA , } ](i)| 2 ^, 
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Indeed, for all s G 
1 
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|J-[/C,(./A)]( s )|^ = |^[/C,]( s A)|^<sup 



J"[/C](tA) 



< sup 



provided that /C has compact Fourier transform. By the same way, 

1 



sup\h GtX (x) - h G > t x(x) 



sup 

xGRJGAG' a 



K. r 



z — x 



A 



dQ(x) 



< CA" 2/3 , (6.2) 
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A 2 " V A 



2 — x 



dx < A"^ 1 / 2 , 



where the last line is inspired by (|6.2|) . 



The following Lemma proposes a generalization to the well-known inequality of 



□ 



< B. 



Lemma 4 Let h a positive and bounded function integrable with respect to Q with 
Suppose the margin assumption holds and denote by a > the margin parameter. Then, there 
exists positive constants c(ct) and C(a) such that: 



ct+l 



< 



|/ - g\h{x)dQ(x) <C(a) J h(x)dQ(x). 



c(a) h(x)dQ(x) 

In particular, for all G±,G 2 C K , we have 

c(a) (d A (G l ,G 2 ))^ < d f , g (G 1 ,G 2 ) < C(a)d A (G l ,G 2 ). 
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Proof. The proof follows exactly the proof of Lemma 2 of [23]. Since Q{K) is bounded, 
Q(\f ~ 9\ ^ V) ^ c 2 r ] a for < 7] < 7]o implies Q(\f — g\ < rf) < C2r/ Q ,Vry > where c 2 := 

C2(a,C2,r]o,Q(K)). Then we have since h > is bounded, choosing 77 = (^jee^ " ' 

J\f-g\hdQ > J\f-g\l(\f-g\>v)hdQ 

> rj^JhdQ- J hl(\f - g\ < V )dQ 

> rjU hdQ - BQ(\f - g\ < rj)\ 

> 7] ( f hdQ - c 2 Brf 



= c(q) (J hdQ 

where c(q) = 2- 1 - 1 / a (Bc 2 )~ 1/a ■ The upper bound is straightforward since \f — g\ is bounded 
from above. 

□ 
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